# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Nov 25 11:49:11 2021"
I’m really excited to start this course. I expect to learn a lot of using R and RStudio.
Here is a link to my GitHub repository: https://github.com/Kuukkelipyy/IODS-project
date()
## [1] "Thu Nov 25 11:49:11 2021"
The second week in short:
The original data was collected in 2014/2015 from students participating in Introduction to Social Statistics -course. For more information about the data and its variables see the data description.
# read data from local file
learning2014 <- read.table("data/learning2014.csv")
# examine the structure and dimensions of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
After data wrangling the data set utilized here consists of 166 observations of 7 variables (i.e. 166 rows and 7 columns), meaning that the data provides information from 166 respondents on seven variables. Deep, stra and surf are combination variables based on multiple questions measuring surface, strategic and deep learning (for more information about the variables see here). My code for the preparation of the data can be found in my repository.Next, I’ll present an overview of the data and its variables.
# summary of the variables in the data
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# show summary of the gender variable when it is transformed to factor
summary(as.factor(learning2014$gender))
## F M
## 110 56
# access to libraries ggplot2 and GGally
## if not installed yet, use: install.packages("name_of_package")
library(GGally)
library(ggplot2)
# scatter plot matrix using ggplot2 and GGally -packages
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
## note: if scatter plot matrix is drawn by using r basic function pairs(), it seems that the gender variable needs to be transformed to a factor (in the Datacamp execise gender is a factor, not character)
# pairs(learning2014[-1], col = as.factor(learning2014$gender))
Above the summary tables show a summary of each variable in the data. Gender is a categorical (‘character’) variable, while the others are continuous (‘numerical’). Regarding the numerical variables the summary provides information about their minimum, maximum and median values as well as lower (1st) and upper (3rd) quartiles of their distributions. As gender variable is a ‘character’, the summary shows only the number of total observations. When the variable is transformed to a factor, the summary function shows the number of female and male respondents.
The graphical presentation of the distributions of all the variables can be seen in the scatter plot matrix. The scatter plot matrix also provides statistics about the correlation between the variables. Here I focus only on the last column of the scatter plot matrix which shows the correlation between ‘points’ and other continuous variables; the highest correlation coefficient is found between points and attitude (0.437), followed by variables stra (0.146) and surf (-0.144), while the correlation coefficient between points and age (-0.093) or points and deep (-0.010) are really low. According to the scatter plot matrix only the correlation between points and attitude is statistically significant (p < 0.001). Also worth to notice, that there are statistically significant correlations between surf and attitude, surf and deep and surf and stra, indicating a potential problem of multicollinearity if those variables were used simultaneously as explanatory variables in a linear regression model.
In the following linear regression model ‘points’ is the response variable and - on the grounds of the correlation coefficients - attitude, stra and surf are selected as explanatory variables.
model1 <- lm(points ~ surf + stra + attitude, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ surf + stra + attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## surf -0.58607 0.80138 -0.731 0.46563
## stra 0.85313 0.54159 1.575 0.11716
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results of the multiple linear regression model are shown in the summary table above. F-statistic with low p-value (p < 0.001) indicates that all the regression coefficients in the model are not zero, meaning that at least one of the variables included in the model has an actual statistical effect on the target variable (points). The multiple R-squared illustrates how well the model is fitting the data. Here, the multiple R-squared (0.2074) shows, that the model (i.e. explanatory variables) explains around 21 % of the variation in the response variable. The summary of residuals (i.e. the difference between observed and predicted response values) provides information about the symmetry of the residual distribution. Residual are dealt more in chapter 2.4 along with model diagnostics.
The regression coefficient describes the relationship between explanatory and response variables; it illustrates the change in the response variable when the explanatory variable alter by one unit and the other variables stay constant. According to the coefficient table the estimates of surf and stra are not statistically significant. Instead the estimate of attitude is statistically significant with very low p-value (p < 0.001), showing that when attitude increases by one unit the points increase 0.3395.
Based on the above presented results from multiple linear regression model, I modify the model by excluding variables which regression coefficients are not statistically significant (p > 0.05). Namely, variables surf and stra are dropped out and only attitude will be included in the model.
model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The results from the simple linear regression analysis are similar to above discussed results from the multivariate linear model. The multiple R-squared decreases a little bit showing that the new model explains about 19 % of the variation in the response variable. The value of multiple R-squared typically decreases when variables are taken out from the model (and increases when more variables are included), while the Adjusted R-squared takes account for the number of explanatory variables in the model. The Adjusted R-squared is practically the same in both models, that is around 19%.
The coefficient table includes two estimates: intercept and attitude. The estimate of the intercept term can be interpreted here as the expected value of the response variable (points) when the explanatory variable (attitude) is 0. This information could be utilized, for instance, if we want to calculate/predict the points based on the known values of the explanatory variable. For example, if a person’s attitude is 25, we can predict their points by calculating as follows: 11.64715 + 0.35255 x 25 = 20,4509 (i.e. intercept + regression coefficient of attitude x value of attitude = points).
The estimate of attitude describes the relationship between attitude and points; the statistically significant (p < 0.001) regression coefficient of attitude (0.3526) shows that when attitude increases by one, the points increase by 0.3526 points.
The key assumptions of linear regression model examined here are:
The validity of these assumptions can be explored by analyzing the residuals. The figure below consists of three diagnostic plots: 1) ‘Residuals vs. Fitted values’, ‘2) Normal QQ-plot’ and 3) ‘Residuals vs. Leverage’.
par(mfrow = c(2, 2)) # to put the following figures in one 2x2 figure
plot(model2, which = c(1,2, 5))
In the figure ‘Residuals vs. fitted’ the residuals are plotted against the fitted values of the response variable. The plot provides a graphical method to examine whether the errors have constant variance. A pattern in the scatter plot would indicate a problem relating this assumption, and imply that the size of the errors depend on the explanatory variables. Here, the residuals seem to be reasonably randomly spread out above and below the line suggesting that the assumptions relating to linearity and constant variance are valid.
Normal QQ-plot of the residuals helps to explore whether the errors are normally distributed. In an ideal case, the residuals would have a perfect match with the diagonal line, and in practice the better the points follow the line, the stronger evidence for the normality assumption. In the QQ-plot above, the residuals fit the line reasonably well, although a little curve is visible as the lower and upper tail deviate slightly from the diagonal line.
‘Residuals vs. leverage’ plot describes how large influence a single observations has in the model and thus helps to identify if some observations have an unusually strong impact on determining the regression line. Observations falling outside of the Cook’s distance line (the dotted red line) are considered to be highly influential to the model fit (as they have large residual and high leverage), meaning that the results would change considerably if the observation were removed from the data. In the figure above the “Cook’s distance” line does not even appear in the plot indicating that none of the observation have an unusually large impact on the results.
date()
## [1] "Thu Nov 25 11:49:18 2021"
My third week in short
# access to all packages needed
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(ggpubr)
library(boot)
# read the prepared data set from local file
alc <- read.table("data/alc.csv", sep = ";")
# examine the data
dim(alc)
## [1] 370 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In this exercise the data is drawn from the Student Performance Data set, which includes information on student achievements in secondary education of two Portuguese schools. After data wrangling, the data set used here consist of 370 observations (rows) of 35 variables (columns), that is information from 370 respondents regarding 35 variables. The names of the variables included in the prepared data set are shown above and more detailed description of them is presented in the description of the Student Performance Data.
After examining the variables in the data I chose the following four variables for further examination as I assume they could be associated with the level of alcohol consumption:
My hypotheses are:
Below I present the summary tables as well as plots of each variable of interest.
# before examining the variables I'll transform the two character variables of interest to factors
alc <- mutate(alc, sex = as.factor(sex), romantic = as.factor(romantic))
# pick the names of the variables of interest
varnames <- select(alc, sex, romantic, studytime, goout, high_use) %>%
colnames()
# summary of each variable
select(alc, varnames) %>%
summary()
## sex romantic studytime goout high_use
## F:195 no :251 Min. :1.000 Min. :1.000 Mode :logical
## M:175 yes:119 1st Qu.:1.000 1st Qu.:2.000 FALSE:259
## Median :2.000 Median :3.000 TRUE :111
## Mean :2.043 Mean :3.116
## 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :4.000 Max. :5.000
# a plot of each variable
gather(alc[varnames]) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar()
Next, I examine how the selected variables are related with alcohol consumption. First, all variables of interest are tabulated with the ‘high use of alcohol’ variable, after which the figure shows the proportional distributions of high users regarding each variable.
# cross-tabulations with high_use:
# sex and high_use
addmargins(table(alc$sex, alc$high_use))
##
## FALSE TRUE Sum
## F 154 41 195
## M 105 70 175
## Sum 259 111 370
# romantic and high_use
addmargins(table(alc$romantic, alc$high_use))
##
## FALSE TRUE Sum
## no 173 78 251
## yes 86 33 119
## Sum 259 111 370
# high_use and studytime
addmargins(table(alc$high_use, alc$studytime))
##
## 1 2 3 4 Sum
## FALSE 56 128 52 23 259
## TRUE 42 57 8 4 111
## Sum 98 185 60 27 370
# high_use and goout
addmargins(table(alc$high_use, alc$goout))
##
## 1 2 3 4 5 Sum
## FALSE 19 82 97 40 21 259
## TRUE 3 15 23 38 32 111
## Sum 22 97 120 78 53 370
# proportion figures
t1 <- ggplot(data = alc, aes(x = sex, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Gender")
t2 <- ggplot(alc, aes(romantic, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Romantic relationship")
t3 <- ggplot(alc, aes(x = goout, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Go out with friends")
t4 <- ggplot(alc, aes(x = studytime, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Studytime")
# all plots in one figure ('ggarrange' is from 'ggpubr' package)
ggarrange(t1 + rremove("legend"), t2 + rremove("legend"), t3 + rremove("legend"), t4 + rremove("legend"),
ncol = 2, nrow = 2,
common.legend = TRUE, legend = "right")
The tables and figure above indicate that bigger share of males are high users of alcohol compared to females. Similarly, the more the student go out with friends or the less time they use for studying, the bigger is the share of high users. In contrast, the share of high users of alcohol is quite the same among those who are in a romantic relationship and those who are not. Thus, these descriptive results provide support for the above-presented hypotheses 1, 3 and 4, while hypothesis 2 is rather questionable.
# logistic regression model
## studytime is used as factors in the model
model1 <- glm(high_use ~ sex + romantic + goout + as.factor(studytime), data = alc, family = "binomial")
# summary of the results
summary(model1)
##
## Call:
## glm(formula = high_use ~ sex + romantic + goout + as.factor(studytime),
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7282 -0.7953 -0.4800 0.8780 2.7249
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1839 0.5126 -6.212 5.25e-10 ***
## sexM 0.7096 0.2667 2.660 0.00781 **
## romanticyes -0.1061 0.2747 -0.386 0.69922
## goout 0.7426 0.1198 6.202 5.59e-10 ***
## as.factor(studytime)2 -0.4045 0.2950 -1.371 0.17033
## as.factor(studytime)3 -1.1405 0.4726 -2.413 0.01581 *
## as.factor(studytime)4 -1.3334 0.6245 -2.135 0.03276 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 379.20 on 363 degrees of freedom
## AIC: 393.2
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(model1) %>%
exp
# compute confidence intervals (CI) for the odds ratios
CI <- confint(model1) %>%
exp
## Waiting for profiling to be done...
# combine the odds ratios and their confidence intervals
OR_with_CI <- cbind(OR, CI)
# round the results
round(OR_with_CI, digits = 2)
## OR 2.5 % 97.5 %
## (Intercept) 0.04 0.01 0.11
## sexM 2.03 1.21 3.45
## romanticyes 0.90 0.52 1.53
## goout 2.10 1.67 2.68
## as.factor(studytime)2 0.67 0.37 1.19
## as.factor(studytime)3 0.32 0.12 0.78
## as.factor(studytime)4 0.26 0.07 0.83
The results from a logistic regression model are presented above. The value of null deviance is for a model without any explanatory variables and the residual deviance is the value when taking all explanatory variables into account. Higher numbers indicate bad fit, and respectively smaller values indicate good fit. The difference between null deviance and residual deviance illustrates how good is the fit of the given model with explanatory variables; the greater the difference, the better. Although the value of the residual deviance is quite high, the difference shows that the model with explanatory variables is more fit than a model including only the intercept term. The value of AIC (Akaike’s information criteria) can be used for comparing competing models (this will be discussed more below).
The coefficients illustrates the association between the explanatory variables and the target variables. The estimates can be also used, for instance, for calculating predicted probabilities of being a high user based on the known values of explanatory values. According to the coefficient table having a romantic relationship is not significantly associated with the use of alcohol. Instead, the results show that gender, time used for studying and going out with friends are statistically significant variables predicting high consumption of alcohol. However, all the associations are not rectilinear, which will be discussed along with the odds ratios.
The odds ratio (OR) illustrates the difference in the odds of being high user between groups. For instance, in the case of sex variable females are the reference group (OR = 1.00) and the odds of males are compared to the odds of females; the odds ratio shows that males (OR 2.03) are about twice as likely to be high users compared to females. Going out with friends more often increase (OR = 2.10) the likelihood of being high user compared to those who goes out the least often (note that the actual scale of the variable is somewhat obscure, and thus difficult to interpret). Finally, compared to students who study less than two hour, those who study 5 to 10 hours are about three times (OR = 0.34) less likely to be high users and respectively those who study over 10 hours (OR = 0.28) are about 3.5 times less likely to be high users. Instead, those who study 2-5 hours are not statistically more or less likely to be high users compared to the reference group (the coefficient is not statistically significant and the CI’s of OR include 0).
According to the above-presented model all variables except ‘romance’ had statistically significant relationship with high/low alcohol consumption. Next, we build a new regression model without the romance variable and examine how accurate the model predictions are.
# new logistic regression model
model2 <- glm(high_use ~ sex + as.factor(studytime) + goout, data = alc, family = "binomial")
# predicted probabilities and prediction (> 0.5) of high_use and add it to the data.frame
probabilities <- predict(model2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the observed high use versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 234 25
## TRUE 58 53
# probability table of high use vs. predictions
prob_table <- table(high_use = alc$high_use, prediction = alc$prediction) %>%
prop.table %>%
addmargins
round(prob_table * 100, digits = 2) # probs in % and rounded
## prediction
## high_use FALSE TRUE Sum
## FALSE 63.24 6.76 70.00
## TRUE 15.68 14.32 30.00
## Sum 78.92 21.08 100.00
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
geom_point(alpha = 0.5, size = 3)
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Calculate the share of false predictions (false positives + false negatives)
## this can be seen from the above prob.table as well
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2243243
The first table provides cross-tabulation of the predictions against the actual values of high_use, and the next table shows the proportions of each cell. According to the proportion table about 77.6 % of predictions (cells: False/false + True/true) match with actual values of observations, while 22,4 % of the predictions (cells: True/False + False/True) are incorrect. This is also confirmed by using the self-defined ‘loss function’ to see the share of false predictions, which is 22,4 %. The plot should visualize the results, although it seems there is something wrong with the figure (but I can’t find an error in the code).
# compute the average number of wrong predictions in the (training) data
## loss func is defined in previous chunk
nwrong_train <- loss_func(class = alc$high_use, prob = alc$probability)
# mean error in in the training data
nwrong_train
## [1] 0.2243243
# 10-fold cross-validation
crossvalid <- cv.glm(data = alc, cost = loss_func, glmfit = model2, K = 10)
# the mean prediction error for the testing data
crossvalid$delta[1]
## [1] 0.2486486
The mean error in the training data is 0.224. The mean prediction error for the testing data is around 0.24 in the 10-fold cross-validation. The mean prediction error in the Datacamp model was about 0.26, suggesting that the model presented here has slightly better test performance; i.e. the model is more accurate predicting the high consumption of alcohol.
Here I utilize the AIC (Akaike’s Information Criteria) backward elimination -procedure for selecting the explanatory variables for the model. The AIC index takes into account the statistical goodness of fit and the number of variables in the model by increasing a penalty for a greater number of variables. In the series of competing models lower AIC values are preferred; i.e. the aim is to achieve as low AIC value as possible by excluding variables - that makes the value higher - one at the time. The elimination ends when removing more variables would not improve the AIC score. I start by selecting 10 variables and from there start the elimination.
# logistic regression model
model3 <- glm(high_use ~ sex + age + Pstatus + absences + failures + schoolsup + as.factor(studytime) + goout + activities + freetime, data = alc, family = "binomial")
step(model3, direction = "backward")
## Start: AIC=386.87
## high_use ~ sex + age + Pstatus + absences + failures + schoolsup +
## as.factor(studytime) + goout + activities + freetime
##
## Df Deviance AIC
## - schoolsup 1 360.87 384.87
## - Pstatus 1 360.89 384.89
## - age 1 361.00 385.00
## - as.factor(studytime) 3 365.07 385.07
## - freetime 1 361.34 385.34
## <none> 360.87 386.87
## - failures 1 363.14 387.14
## - activities 1 363.79 387.79
## - sex 1 370.47 394.47
## - absences 1 372.93 396.93
## - goout 1 392.92 416.92
##
## Step: AIC=384.87
## high_use ~ sex + age + Pstatus + absences + failures + as.factor(studytime) +
## goout + activities + freetime
##
## Df Deviance AIC
## - Pstatus 1 360.89 382.89
## - age 1 361.00 383.00
## - as.factor(studytime) 3 365.08 383.08
## - freetime 1 361.34 383.34
## <none> 360.87 384.87
## - failures 1 363.15 385.15
## - activities 1 363.80 385.80
## - sex 1 370.75 392.75
## - absences 1 372.93 394.93
## - goout 1 392.92 414.92
##
## Step: AIC=382.89
## high_use ~ sex + age + absences + failures + as.factor(studytime) +
## goout + activities + freetime
##
## Df Deviance AIC
## - age 1 361.03 381.03
## - as.factor(studytime) 3 365.09 381.09
## - freetime 1 361.38 381.38
## <none> 360.89 382.89
## - failures 1 363.18 383.18
## - activities 1 363.80 383.80
## - sex 1 370.77 390.77
## - absences 1 372.99 392.99
## - goout 1 392.99 412.99
##
## Step: AIC=381.03
## high_use ~ sex + absences + failures + as.factor(studytime) +
## goout + activities + freetime
##
## Df Deviance AIC
## - as.factor(studytime) 3 365.16 379.16
## - freetime 1 361.50 379.50
## <none> 361.03 381.03
## - failures 1 363.56 381.56
## - activities 1 364.01 382.01
## - sex 1 371.03 389.03
## - absences 1 373.55 391.55
## - goout 1 394.24 412.24
##
## Step: AIC=379.16
## high_use ~ sex + absences + failures + goout + activities + freetime
##
## Df Deviance AIC
## - freetime 1 365.69 377.69
## <none> 365.16 379.16
## - failures 1 369.03 381.03
## - activities 1 369.14 381.14
## - absences 1 379.72 391.72
## - sex 1 380.24 392.24
## - goout 1 398.78 410.78
##
## Step: AIC=377.69
## high_use ~ sex + absences + failures + goout + activities
##
## Df Deviance AIC
## <none> 365.69 377.69
## - activities 1 369.50 379.50
## - failures 1 369.59 379.59
## - absences 1 380.01 390.01
## - sex 1 382.11 392.11
## - goout 1 404.79 414.79
##
## Call: glm(formula = high_use ~ sex + absences + failures + goout +
## activities, family = "binomial", data = alc)
##
## Coefficients:
## (Intercept) sexM absences failures goout
## -3.99781 1.05729 0.08345 0.45660 0.72059
## activitiesyes
## -0.51287
##
## Degrees of Freedom: 369 Total (i.e. Null); 364 Residual
## Null Deviance: 452
## Residual Deviance: 365.7 AIC: 377.7
The backward elimination suggest that we should keep five of those original 10 variables; sex, failures, activities, absences and goout. Interestingly, studytime was excluded! Next, I will run the logistic model with those variables and conduct the 10-fold cross-validation.
model4 <- glm(high_use ~ sex + failures + activities + absences + goout,
data = alc, family = "binomial")
summary(model4)
##
## Call:
## glm(formula = high_use ~ sex + failures + activities + absences +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2685 -0.7612 -0.5209 0.7648 2.4960
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.99781 0.48451 -8.251 < 2e-16 ***
## sexM 1.05729 0.26683 3.962 7.42e-05 ***
## failures 0.45660 0.23337 1.957 0.05040 .
## activitiesyes -0.51287 0.26475 -1.937 0.05272 .
## absences 0.08345 0.02258 3.695 0.00022 ***
## goout 0.72059 0.12300 5.858 4.67e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 365.69 on 364 degrees of freedom
## AIC: 377.69
##
## Number of Fisher Scoring iterations: 4
# 10-fold cross-validation
crossvalid2 <- cv.glm(data = alc, cost = loss_func, glmfit = model4, K = 10)
# the mean prediction error for the testing data
crossvalid2$delta[1]
## [1] 0.1972973
# visual
probabilities_m4 <- predict(model4, type = "response")
alc <- mutate(alc, probability_m4 = probabilities_m4)
alc <- mutate(alc, prediction_m4 = probability_m4 > 0.5)
ggplot(alc, aes(x = probability_m4, y = high_use, col = prediction_m4)) +
geom_point(alpha = 0.5, size = 3)
The table above presents the summary of the fourth model. According to the 10-fold cross-validation, the mean prediction error for the testing data is around 0.20, suggesting that this model has better test performance than the model 3 examined above. The figure provides of graphical confirmation/evidence for this assumption. Moreover, the value of residual deviance (smaller) and the difference between null vs. residual deviance (bigger) suggest that this new model is better than the previous one.
Thanks for reading and for the upcoming feedback!
date()
## [1] "Thu Nov 25 11:49:22 2021"
My fourth week in short
# access to packages
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
# get the data (included in MASS)
data(Boston)
# structure of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data set utilized here includes ‘Housing Values in Suburbs of Boston’, that is different variables relating housing in the city of Boston. The data consists of 506 observations of 14 numeric (or integer) variables (i.e. 506 rows and 14 columns). Description of the variables can be seen from the description of the data set
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# let's use gather from tidyr-package; gather returns key-value pairs of variables
# draw a bar plot of each variable
gather(Boston) %>%
ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_histogram()
# construct a correlation matrix and round the results
cor_matrix <-cor(Boston) %>%
round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method = "circle", type = "upper", cl.pos = "r", tl.pos = "d", tl.cex = 0.8)
Summaries of the variables and the histograms presented above illustrate that the variables in the data set have quite skewed distributions and that the variables also have quite different scales compared to each other.
The correlation matrix shows that many of variables are highly correlated. The correlation coefficients vary from 1 to -1. A positive coefficient indicate that high values of variable ‘X’ are associated with the high values of variable ‘Y’. Respectively negative coefficient indicates that high values of ‘X’ are associated with low values of ‘Y’.
First I’ll scale the Boston data, that is standardize each variable by its scale and save the standardized variables as a data.frame instead of a matrix. When variables are centered, their mean is adjusted to zero as can be seen from the summaries of the scaled variables which are shown below.
After scaling the data, I will create a new factor variable from the ‘crime rate per capita’ variable and use the quantiles as cut points. A summary of the new ‘crime’ variable can be found below.
Finally, I will split the scaled data into a training and testing data sets, which are then used in the next subchapter.
# scale the Boston data and transform the matrix to data.frame
boston_scaled <- Boston %>%
scale() %>%
as.data.frame()
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# pick the quantiles of crim
bins <- quantile(boston_scaled$crim)
# create label names
crim_lab <- c("low", "med_low", "med_high", "high")
# create new factor variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = crim_lab)
# add the new variable to the data frame and remove the old one
boston_scaled$crime <- crime
# NB: if dplyr is not defined below, error may occur when knitting index-file, because tries to use different package!
boston_scaled <- dplyr::select(boston_scaled, -crim)
## alternative way: boston_scaled <- subset(boston_scaled, select = -crim)
# summary of the new variable
table(boston_scaled$crime)
##
## low med_low med_high high
## 127 126 126 127
# take a sample of 80% observations; i.e. pick randomly row numbers between 1 and 0.8 x rows
# pick the number of total observations in the data
n <- nrow(boston_scaled)
# take a sample of 80% observations; i.e. pick randomly row numbers between 1 and 0.8 x rows
train_indexes <- sample(n, size = n * 0.8)
# create the training set by using the defined indexes for rows
train_set <- boston_scaled[train_indexes,]
# create the testing set by excluding the row used for training set
test_set <- boston_scaled[-train_indexes,]
Relating this part of the exercise there was multiple problems with the DataCamp exercises and the video was not available. Anycase, let’s fit the LDA and use ‘crime’ variable as the target variable and all the other variables as predictors. Below I show the results in a table and in a biplot. No interpretation were asked regarding this in the instructions, so none provided.
# fit the linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train_set)
# print the LDA results
lda.fit
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2673267 0.2400990 0.2400990
##
## Group means:
## zn indus chas nox rm age
## low 0.8820958 -0.9200942 -0.19513102 -0.8478258 0.44761627 -0.8561416
## med_low -0.1495827 -0.2656327 -0.05360128 -0.5463269 -0.16142707 -0.3324223
## med_high -0.3891090 0.1518098 0.21473488 0.3925735 0.02051922 0.3980438
## high -0.4872402 1.0172187 -0.15056308 1.0341139 -0.38636894 0.8155224
## dis rad tax ptratio black lstat
## low 0.8682244 -0.6891247 -0.7351723 -0.39381703 0.3755724 -0.76149283
## med_low 0.3077654 -0.5554496 -0.4603125 -0.09322631 0.3244670 -0.10037260
## med_high -0.3631561 -0.4597331 -0.3584517 -0.21660391 0.0973394 0.04914639
## high -0.8471459 1.6371072 1.5133254 0.77958792 -0.6831779 0.94495303
## medv
## low 0.513326282
## med_low -0.002157566
## med_high 0.076242944
## high -0.799873264
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.049486013 0.65511306 -0.97484049
## indus 0.101881230 -0.21572805 0.36337221
## chas -0.168419665 -0.15680650 0.04840405
## nox 0.276243540 -0.75327521 -1.50239868
## rm -0.094489913 -0.02632840 -0.18406758
## age 0.179455348 -0.30731359 -0.18472257
## dis -0.044963835 -0.18616611 0.04837891
## rad 3.727372902 0.89365941 -0.12932975
## tax -0.005777846 0.03142512 0.72642830
## ptratio 0.091156742 -0.01699341 -0.44792743
## black -0.096586342 0.09139483 0.20082253
## lstat 0.233271461 -0.23267700 0.35329598
## medv 0.170234430 -0.42102266 -0.17972763
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9565 0.0305 0.0130
# draw LDA biplot
classes <- as.numeric(train_set$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
Next, I will save the observed frequencies of the crime categories from the test data and then remove the variable, after which I test how well the fitted model predicts the values. The results are shown below. The classifier, that is the fitted model, did not predict the crime rates perfectly, alghouth the predictions are somewhat accurate. Most of the predictions matches the observed classes, and those that do not match falls mostly to the next classes (i.e. prediction are not right but close).
# save the crime variable separately
correct_cases <- test_set$crime
# remove crime variable from the test data
## MASS has also select() -> dplyr::select
test_set <- dplyr::select(test_set, -crime)
# use LDA model for predicting crime cases
lda.predict <- predict(lda.fit, newdata = test_set)
# cross tabulate the correct cased with predictions
table(correct = correct_cases, predicted = lda.predict$class)
## predicted
## correct low med_low med_high high
## low 17 7 1 0
## med_low 4 12 2 0
## med_high 0 7 19 3
## high 0 0 0 30
Here, I reloaded the Boston data and recreated a new data set, in which the variables are standardized to be able to compare the distances. Next, I calculate the ‘euclidean’ distances between observations. A summary of the calculated distances are shown in the table above.
# clean the environment (i.e. data.frames and variables etc.)
rm(list = ls())
# reload the Boston dataset
data(Boston)
# scale the Boston data and transform the matrix to data.frame
boston_scaled <- Boston %>%
scale() %>%
as.data.frame()
dist_euc <- dist(boston_scaled, method = "euclidean")
summary(dist_euc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
After that K-means clustering is conducted. First with three centers and then after examining the within cluster sum of squares (WCSS) K-means clustering is run again with two centers that was suggested by the results of the investigation of the WCSS. Below the results from two-center-clustering are visualized in the scatter plot matrices in which the clusters are colored; first plot shows all of the variables (which is quite useless) and then the three plots show parts of the big picture; all of them the two more or less distinct cluster are visible.
# k-mean clustering with 4 centers
km_boston <- kmeans(boston_scaled, centers = 3)
# examination of WCSS, that is deciding optomal number of centers
# set seed
set.seed(123)
# determine the number of clusters in the examination
k_max <- 10
# calculate the total WCSS
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# examination of WCSS and the visualization suggest that two center would be optimal
km_boston <- kmeans(boston_scaled, centers = 2)
# visualze the k-mean clustering results
pairs(boston_scaled, col = km_boston$cluster)
pairs(boston_scaled[1:5], col = km_boston$cluster)
pairs(boston_scaled[6:10], col = km_boston$cluster)
pairs(boston_scaled[11:14], col = km_boston$cluster)
THAT’S ALL!
date()
## [1] "Thu Nov 25 11:49:32 2021"
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyr)
The data set used here is drawn from Human Development Index and Gender Development Index. More information can be found from the United Nations Development Programme’s webpage. The prepared data set examined here consists of 155 observations (that is countries) of 8 variables.
Description of the variables in the prepared data:
The structure of the data and summaries of the variables are shown below.
# download the data
human <- read.table("data/human.csv", sep = ";")
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ GNIperCap : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ LifeExp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ ExpEdu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ MatMortalityRate : int 4 6 6 5 6 7 9 28 11 8 ...
## $ AdoBirthRate : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ FemalesParliament: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ LabourRatio : num 0.891 0.819 0.825 0.884 0.829 ...
## $ SecondEduRatio : num 1.007 0.997 0.983 0.989 0.969 ...
summary(human)
## GNIperCap LifeExp ExpEdu MatMortalityRate
## Min. : 581 Min. :49.00 Min. : 5.40 Min. : 1.0
## 1st Qu.: 4198 1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 11.5
## Median : 12040 Median :74.20 Median :13.50 Median : 49.0
## Mean : 17628 Mean :71.65 Mean :13.18 Mean : 149.1
## 3rd Qu.: 24512 3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 190.0
## Max. :123124 Max. :83.50 Max. :20.20 Max. :1100.0
## AdoBirthRate FemalesParliament LabourRatio SecondEduRatio
## Min. : 0.60 Min. : 0.00 Min. :0.1857 Min. :0.1717
## 1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.5984 1st Qu.:0.7264
## Median : 33.60 Median :19.30 Median :0.7535 Median :0.9375
## Mean : 47.16 Mean :20.91 Mean :0.7074 Mean :0.8529
## 3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.8535 3rd Qu.:0.9968
## Max. :204.80 Max. :57.50 Max. :1.0380 Max. :1.4967
The graphical overview of the data is presented below. The scatter plot matrix shows the distributions of the variables and illustrates their relationship. A better overview of the relationship between the variables is provided by the correlation matrix, which shows that many of the variables are highly correlated with each other. The correlation coefficients vary from 1 to -1. A positive coefficient indicates that high values of variable ‘X’ are associated with the high values of variable ‘Y’. Respectively negative coefficient indicates that high values of ‘X’ are associated with low values of ‘Y’.
# scatter plot matrix using ggplot2 and GGally -packages
ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
# correlation matrix
cor(human) %>%
corrplot(method = "circle", type = "upper", cl.pos = "r", tl.cex = 0.8)
Next, I will perform principal component analysis (PCA) on the human data: first with unstandardized data and after that with standardized data set.
# PCA with SVD method (unstandardized data)
pca_human_ustd <- prcomp(human)
# summary of PCA results
smry <- summary(pca_human_ustd)
smry
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#rounded percentages of variance captured by each PC
pca_pr <- round(1*smry$importance[2, ] * 100, digits = 2)
# create object pc_lab (variance captured) to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# biplot of the resuts
biplot(pca_human_ustd,
choices = 1:2,
cex = c(0.7, 0.9),
col = c("black", "red"),
xlab = pc_lab[1], ylab = pc_lab[2])
The results from PCA performed with unstandardized data are shown above in the summary table and graphically in the biplot. They clearly show that the first principal component captures (misleadingly) almost all variance (99.99%), which is due to the use of unstandardized data. PCA is sensitive to the relative scaling of the data and it assumes that variables with larger variance are more important than those with smaller variance, which is not the case here and, thus the data must be standardized before performing PCA.
The results from PCA of standardized data are presented below. Now the first principal component captures about 54 percent of the variance and the second component about 16 percent. In the biplot the arrows can be interpreted as follows:
From the figure one can notice, for instance, that GNI per capita, life expectancy, expected education and gender ratio in the secondary education are all highly ‘positively’ correlated with each other. Respectively, maternal mortality is positively correlated with adolescence birth rate and these two are negatively correlated with the variables mentioned in the previous sentence (e.g. maternal mortality is negatively correlated with life expectancy). Moreover, the figure illustrate that all these above mentioned features are contributing to the first principal component. Instead, the variables FemalesParliament and LabourRatio are contributing to the second principal component and the vertical arrows illustrate that the share of women in the parliament correlates positively with the labour ratio between the genders.
# standardize the human data and save it as data frame
human_std <- as.data.frame(scale(human))
# PCA with SVD method (standardized data)
pca_human_std <- prcomp(human_std)
# summary of PCA results
smry_std <- summary(pca_human_std)
#rounded percentages of variance captured by each PC
pca_pr_std <- round(1*smry_std$importance[2, ] * 100, digits = 2)
# create object pc_lab (variance captured) to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")
# biplot of the resuts
biplot(pca_human_std,
choices = 1:2,
cex = c(0.5, 0.7),
col = c("black", "red"),
xlab = pc_lab_std[1], ylab = pc_lab_std[2])
The ‘tea’ data used here consists of 300 observations of 36 variables. I do not see any reason for visualizing the whole data set, so I just don’t do it. Instead, I pick a few of the variables. The summary and visualization of the subset is shown below.
# load the data set from FactoMineR-package
library(FactoMineR)
data("tea")
# explore the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# select a few colums and create a new data set
keep_vars <- c("Tea", "How", "how", "sugar", "where", "lunch")
keep_vars2 <- c("Tea", "how", "where")
tea_time <- dplyr::select(tea, one_of(keep_vars))
# structure and summary of the new dataset
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
#visualization of the new data set
gather(tea_time) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Next, I will perform the Multiple Correspondence Analysis (MCA). The visualization of the results from MCA are shown below.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the MCA model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# plot the MCA results
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
In the figure the variables are drawn on the first two dimensions. The distance between the categories provides a measure of their similarity. The vertical dimension relates, at least to some extent, to from where the tea is bought. Respectively, the horizontal dimension seems to be related to how the tea is consumed (e.g. tea bags, unpacked etc.). From the figure one can interpret, for example, that buying tea from a tea shop is closer to using unpacked tea than to using tea bags and similarly getting tea from a chain store is closer to using tea bags than to using unpacked tea.